Read the data in using readxl::read_xlsx. What other ways can we read this data in?
Data can be read in depending upon the file type. So, for an excel file, we can use -
read.excel
read.xls
read.table
What type of data is in the magda_cleanedmiRNAdata.xlsx file? How is it structured?
The data shows micro RNA levels detected from different tissues obtained at different gestational ages of several differnt foetuses.
How many “variables” are present?
There are 27 variables present, namely -
Reading data in
library(dplyr)
library(tidyr)
library(readxl)
library(tidyverse)
library(DT)
library(knitr)
library(kableExtra)
library(ggplot2)
library(rmarkdown)
raw_miRNA <- read_excel("Z:/Nikita/Projects/mirna_fetal_tissues/data/external/magda_cleanedmiRNAdata.xlsx", col_names = FALSE, sheet = "miRNA")Variables present:
## # A tibble: 27 x 1
## ...1
## <chr>
## 1 lamID
## 2 gscID
## 3 caseID
## 4 tissue
## 5 condition
## 6 conditionEx
## 7 GA
## 8 trimester
## 9 sex
## 10 processingTime
## 11 libraryID
## 12 flowCell
## 13 Lane
## 14 indexAdapter
## 15 MLPA
## 16 misoprostol
## 17 modeLabour
## 18 medicalTA
## 19 totalReadsAligned_miRNA
## 20 propFullyIn_miRNA
## 21 propPartiallyIn_miRNA
## 22 totalReadsAligned_piRNA
## 23 propFullyIn_piRNA
## 24 propPartiallyIn_piRNA
## 25 RQS
## 26 Degraded
## 27 450k.array
How many patients are present?
There are a total of 106 patients present, of which
45 are female:
61 are male:
Hence,
106 (word: MALE) - 45 (word: FEMALE)
= 61
How many samples are present?
- There are 106 total patients.
- Number of miRNAs =
2822 (total rows) - 29 (variables plus initial empty rows)
= 2793
Hence,
2793 * 106
= 296,058
What is the breakdown of tissues / trimester? Generate a pretty table using kableExtra::kable(). Do this with at least 1 other variable of your choosing.
I decided to subset 5 variables -
- tissue
- extended sample condition information
- gestational age
- trimester
- sex
Summary of the variables:
## tissue conditionEx GA
## brain :10 spina bifida :27 22 :12
## chorionic villi:32 pPROM :22 17 : 7
## kidney :10 control :18 18 : 7
## liver :10 anencephaly :10 21.6 : 7
## lung :24 GU abnormalities : 7 22.4 : 7
## muscle :10 Lipomyelomeningocele: 5 20.399999999999999: 6
## spinal cord :10 (Other) :17 (Other) :60
## trimester sex
## 1: 6 FEMALE:45
## 2:90 MALE :61
## 3:10
##
##
##
##
Since the data was all in a matrix form before, I converted it into a dataframe raw_miRNA.df:
t_raw_mirna <- t(raw_miRNA)
colnames(t_raw_mirna) = t_raw_mirna[1, ]
t_raw_mirna <- t_raw_mirna[-1, ]
raw_miRNA.df <- as.data.frame(t_raw_mirna)raw_miRNA.df is now: 1. the transposed version of raw_miRNA
2. a dataframe
3. has column names
4. easier to view and understand
Moving all of the non-miRNA data (characteristics) into a separate df:
pDat <- raw_miRNA.df[ c(1:28) ]
colnames(pDat)[1] <- "ID"
pDat <- head(pDat,-8)
pDat <- as_tibble(pDat)
paged_table(pDat)And only the miRNA expression level data into o_miRNA.df
For calculating the breakdown of tissues / trimester:
tri_tis <- (pDat [(c(5,9))])
tri_tis <- rename(count(tri_tis, trimester, tissue), Freq = n) #getting frequency (sum) of all total number of samples
tri_tis <- tri_tis[, c(2,1,3)]
tri_tis <- tri_tis %>%
spread(trimester, Freq) %>%
replace(is.na(.), 0)
names(tri_tis) <- c("Tissue","Trimester_1", "Trimester_2", "Trimester_3")
#tri_tis %>%
# mutate (Tissue = cell_spec(Tissue, color = "pink", background ="black")) %>%
kable(tri_tis, align = "c") %>%
kable_styling(bootstrap_options = c("striped", position = "centre", font_size = 12))| Tissue | Trimester_1 | Trimester_2 | Trimester_3 |
|---|---|---|---|
| brain | 0 | 10 | 0 |
| chorionic villi | 6 | 16 | 10 |
| kidney | 0 | 10 | 0 |
| liver | 0 | 10 | 0 |
| lung | 0 | 24 | 0 |
| muscle | 0 | 10 | 0 |
| spinal cord | 0 | 10 | 0 |
All the tissues except for chorionic villi contained samples from only trimester 2 pregnancies.
Showing the above as a bar plot:
ggplot (data = pDat, aes(x = trimester, fill=tissue)) + #max no of samples without specified y-axis legth
labs(title="Tissues per Trimester", subtitle = "Colors denote count of tissue", x ="Trimester", y ="No. of samples") +
# scale_fill_brewer(palette=11) + #to add colours from a shaded pallete, only works for multiple variables
geom_bar (stat = "count", width=0.8) +
coord_cartesian(ylim=c(0,100))